Analyzing Shakespeare

1.) To get a first idea, run a quick analysis on the text using Unix tools such as wc and grep to answer the following questions:

a: wc Shakespeare.txt: 124456 bytes, 901325 lines, 5458199 words

b: grep -c -i "by William Shakespeare" Shakespeare.txt: 38

2.)

The first part of the pipeline searches all occurences of "by William Shakespeare" in Shakespeare.txt. The Parameter -B 6 further includes the 6 lines before the selected pattern. The output of the first part is then matched again with a expression in three pieces:

The regex ^$ is denied by the option -e so all empty lines are excluded. The regex tr '\n' ' ' is used to replace all newlines with a space. The regex sed 's/ -- /\n/g' is used to replace all double dashes with a newline.

The execution involes 4 processes.

3.)

a) monotonically_increasing_id => filter b) filter(... isin copyright) c) withColumn when contains "by William Shakespeare" lag... d+e) repartition => groupBy => agg

4.)

Preparations

a)

b

c

split by title

d+e

additional steps

5

6

7

Stopwords are common words that add no meaning to a text. Therefore they are often removend during text mining. A collection of stopword-lists is avaible in this repo: https://github.com/stopwords-iso/stopwords-en/tree/master/raw

Obviously the stopwords lexicon wasn't to sensible. Thy is a word that should be removed as well. Research reveals a stopword lexica for middle english might be useful. The python package cltk provides several for older languages... Otherwise there seems to be a small issue with hamlet.

No more thy!

Extra: Plotting Top-ten word counts for top ten plays